4 research outputs found
AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Audio-visual speaker diarization aims at detecting "who spoke when" using
both auditory and visual signals. Existing audio-visual diarization datasets
are mainly focused on indoor environments like meeting rooms or news studios,
which are quite different from in-the-wild videos in many scenarios such as
movies, documentaries, and audience sitcoms. To develop diarization methods for
these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD)
dataset. Our experiments demonstrate that adding AVA-AVD into training set can
produce significantly better diarization models for in-the-wild videos despite
that the data is relatively small. Moreover, this benchmark is challenging due
to the diverse scenes, complicated acoustic conditions, and completely
off-screen speakers. As a first step towards addressing the challenges, we
design the Audio-Visual Relation Network (AVR-Net) which introduces a simple
yet effective modality mask to capture discriminative information based on face
visibility. Experiments show that our method not only can outperform
state-of-the-art methods but is more robust as varying the ratio of off-screen
speakers. Our data and code has been made publicly available at
https://github.com/showlab/AVA-AVD.Comment: ACMMM 202
PV3D: A 3D Generative Model for Portrait Video Generation
Recent advances in generative adversarial networks (GANs) have demonstrated
the capabilities of generating stunning photo-realistic portrait images. While
some prior works have applied such image GANs to unconditional 2D portrait
video generation and static 3D portrait synthesis, there are few works
successfully extending GANs for generating 3D-aware portrait videos. In this
work, we propose PV3D, the first generative framework that can synthesize
multi-view consistent portrait videos. Specifically, our method extends the
recent static 3D-aware image GAN to the video domain by generalizing the 3D
implicit neural representation to model the spatio-temporal space. To introduce
motion dynamics to the generation process, we develop a motion generator by
stacking multiple motion layers to generate motion features via modulated
convolution. To alleviate motion ambiguities caused by camera/human motions, we
propose a simple yet effective camera condition strategy for PV3D, enabling
both temporal and multi-view consistent video generation. Moreover, PV3D
introduces two discriminators for regularizing the spatial and temporal domains
to ensure the plausibility of the generated portrait videos. These elaborated
designs enable PV3D to generate 3D-aware motion-plausible portrait videos with
high-quality appearance and geometry, significantly outperforming prior works.
As a result, PV3D is able to support many downstream applications such as
animating static portraits and view-consistent video motion editing. Code and
models will be released at https://showlab.github.io/pv3d
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
We introduce HOSNeRF, a novel 360{\deg} free-viewpoint rendering method that
reconstructs neural radiance fields for dynamic human-object-scene from a
single monocular in-the-wild video. Our method enables pausing the video at any
frame and rendering all scene details (dynamic humans, objects, and
backgrounds) from arbitrary viewpoints. The first challenge in this task is the
complex object motions in human-object interactions, which we tackle by
introducing the new object bones into the conventional human skeleton hierarchy
to effectively estimate large object deformations in our dynamic human-object
model. The second challenge is that humans interact with different objects at
different times, for which we introduce two new learnable object state
embeddings that can be used as conditions for learning our human-object
representation and scene representation, respectively. Extensive experiments
show that HOSNeRF significantly outperforms SOTA approaches on two challenging
datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and
compelling examples of 360{\deg} free-viewpoint renderings from single videos
will be released in https://showlab.github.io/HOSNeRF.Comment: Project page: https://showlab.github.io/HOSNeR